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Abstract 

Background: Genome-wide time-series data provide a ricli set of information for discovering gene regulatory 
relationships. As genome-wide data for mammalian systems are being generated, it is critical to develop network 
inference methods that can handle tens of thousands of genes efficiently, provide a systematic framework for the 
integration of multiple data sources, and yield robust, accurate and compact gene-to-gene relationships. 

Results: We developed and applied ScanBMA, a Bayesian inference method that incorporates external information 
to improve the accuracy of the inferred network. In particular, we developed a new strategy to efficiently search the 
model space, applied data transformations to reduce the effect of spurious relationships, and adopted the ^-prior to 
guide the search for candidate regulators. Our method is highly computationally efficient, thus addressing the 
scalability issue with network inference. The method is implemented as the ScanBMA function in the networkBMA 
Bioconductor software package. 

Conclusions: We compared ScanBMA to other popular methods using time series yeast data as well as time-series 
simulated data from the DREAM competition. We found that ScanBMA produced more compact networks with a 
greater proportion of true positives than the competing methods. Specifically, ScanBMA generally produced more 
favorable areas under the Receiver-Operating Characteristic and Precision-Recall curves than other regression-based 
methods and mutual-information based methods. In addition, ScanBMA is competitive with other network inference 
methods in terms of running time. 
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Background 

Identifying gene regulatory networks is an important 
problem in biology. There have recently been many 
advances in this area in terms of tools for collecting 
and analyzing large-scale genomics data. Many of these 
datasets, from microarrays and next generation sequenc- 
ing, quantify the expression levels of all genes in a 
given genome. Genome-wide time-series data, in prin- 
ciple, allow reverse engineering of the gene regulatory 
relationships by studying the temporal patterns of regu- 
lators and target genes. However, this can be a difficult 
problem due to the large number of genes {i.e. variables) 
being measured, which typically far exceeds the number 
of observations. Also, the number of actual regulators for 
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a particular gene is only a small fraction of the number of 
possible regulators. 

A popular method for inferring gene regulatory net- 
works from time series data uses Dynamic Bayesian Net- 
works (DBN) [1-5]. DEN methods estimate a probabilistic 
graphical model, given the time-series data. DBN meth- 
ods work well, but the network size that they can handle 
in practice is limited because of their computational cost. 

Ordinary differential equations (ODE) are alternative 
methods for constructing networks [6,7]. These methods 
are deterministic rather than statistical, although ODE 
methods can be combined with statistical methods. DBN 
on local networks within a larger ODE model inference 
method have been used, for example [8]. 

Another class of methods is based on regression models 
in which parent nodes (regulators) are inferred for each 
target gene. Vector autoregressive models have been pro- 
posed for inferring causal links between genes. Often this 
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takes the form of a model selection problem, and meth- 
ods such as the Least Absolute Shrinkage and Selection 
Operator (LASSO) [9,10], elastic net [11,12], and Bayesian 
model averaging (BMA) [13,14] have been used [15-19]. 
Morrissey, et al. [20] implemented a Markov Chain Monte 
Carlo (MCMC) sampler for a fully Bayesian formulation 
of the autoregressive model. 

Mutual information methods have been used exten- 
sively on genetics data [21-24], but usually for steady-state 
rather than time-series data. These methods are typi- 
cally non-directional. Recently, mutual information meth- 
ods have been extended to analyze time-series data and 
produce directed networks [25,26]. Mutual information 
methods have the advantage of being able to identify 
nonlinear relationships. 

Our contributions 

We present a new approach using Bayesian Model Aver- 
aging (BMA) for variable selection from time- series gene 
expression data. Our new method offers the following 
advances over our previous work [18,19]: 

• We develop a new algorithm called ScanBMA that 
searches the model space more efficiently and 
thoroughly than previous algorithms. It is much 
faster than previous implementations of BMA for a 
large number of predictors, resulting in running time 
comparable to that of LASSO. It allows inference for 
networks of thousands of genes to be completed in 
minutes on a standard laptop computer. 

• We transform the time-series data to reduce spurious 
correlations. Specifically, we remove the effect of a 
gene on itself by subtracting the mean expression 
level for each gene at each time point and then using 
the residuals from a regression of its expression at the 
current time point on its expression at the previous 
time point. 

• We use Zellner's g-prior [27] for the regression 
parameters and show that using the g-prior to 
compute posterior probabilities out-performs our 
previous effort using the Bayesian Information 
Criterion (BIC). 

• We address the scalability of network inference 
methods. Our new implementation produces running 
times of minutes compared to hours or even days for 
some competing methods, thus offering substantial 
improvements. 

We also carried out extensive empirical studies of our 
new method. Specifically, we compared our new method, 
ScanBMA, to other network construction methods in the 
literature, namely LASSO, the mutual information meth- 
ods MRNET (Maximum Relevance/Minimum Redun- 
dancy) [24], CLR (Context Likelihood or Relatedness) 
[23] and ARACNE (Algorithm for the Reconstruction 



of Accurate Cellular Networks) [22], and also Dynamic 
Bayesian Networks (DBN) when the latter were computa- 
tionally feasible. 

We benchmarked the performance of our approach, 
ScanBMA, using two datasets. The first dataset measures 
the gene expression levels over time of 97 yeast segregants 
perturbed with the drug rapamycin. The second dataset 
consists of simulated time-series data from the DREAM4 
(Dialogue for Reverse Engineering Assessment and Meth- 
ods) challenge. For the yeast dataset, we found that our 
method outperformed competitors and previous analyses 
in recovering regulatory relationships previously reported 
in the literature. For the DRE AM4 data, for which no prior 
information was available, our method performed com- 
parably to other methods, while producing more com- 
pact networks. Finally, the ScanBMA algorithm presents 
a substantial improvement in running time over previous 
implementations of BMA. The method is implemented as 
the ScanBMA function in the networkBMA Bioconductor 
software package. 

Results and discussion 

Method outline 

In ScanBMA, network inference is formulated as a series 
of variable selection problems in which parent nodes 
(regulators) are inferred for each target gene. The BMA 
framework accounts for model uncertainty in variable 
selection by averaging over the posterior distributions 
from multiple models, weighted by their posterior model 
probabilities [13,14]. A challenge of BMA is to efficiently 
select a set of models to be averaged over. ScanBMA uses 
a greedy approach to explore the model space and uses 
the Occam's window principle [28] to eliminate unlikely 
models. 

We previously developed a supervised framework to 
integrate external data sources, including co-expression, 
genome-wide binding, sequence polymorphism, physical 
interaction, genetic interaction, and literature curation 
data [18]. Using a training set consisting of approximately 
500 known regulatory relationships in the literature, we 
computed prior probabilities of regulatory relationships 
across all candidate genes and regulators. These prior 
probabilities were then used to compute the posterior 
probabilities of of candidate regression models. We used 
Zellner s ^-prior [27] to specify the prior for the model 
parameters in ScanBMA. We developed an expectation- 
maximization (EM) algorithm to estimate the prior vari- 
ance parameter g. 

Before the regression step, we apply a univariate 
measure (such as R-squared or BIC) to rank candi- 
date regulators for each target gene using these prior 
probabilities of regulatory relationships. The parameter 
nvar controls the number of top regulators used in the 
regression step of each target gene. We have performed 
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empirical studies to study the effect of and estimate the 
optimal nvar. 

Assessment 

A number of metrics have been used to evaluate the 
quality of inferred networks. We focus on a few that com- 
pare the inferred network with a gold-standard network 
of true edges. One measure that we use is the preci- 
sion of the inferred network, equal to the number of true 
positives divided by the total number of edges in the 
inferred network. Precision is important to researchers 
because an experiment to verify relationships identified 
in exploratory work can be expensive. Thus, the more 
confident we can be when identifying relationships, the 
better. In light of this, we also look at the area under the 
precision-recall curve (AUPRC). This gives a more com- 
prehensive view of network quality and does not require 
that a threshold be chosen for the posterior probability 
of an edge or for the number of edges included. We also 
look at the area under the ROC curve (AUROC), which is 
widely used to assess the quality of networks. 

Due to incomplete knowledge in real data, we use a par- 
tial assessment based on the YEASTRACT database [29]. 
This is a literature-curated repository of regulatory rela- 
tionships between known transcription factors and target 
genes in yeast, based on more than 1,300 literature refer- 
ences. Due to incomplete knowledge in yeast biology, the 
lack of an edge in YEASTRACT is not hard evidence of 
the absence of a relationship between two genes, although 
it is used as such in our evaluation. In contrast, the true 
underlying networks from the simulated DREAM4 data 
are known, so that an absence of a true edge is a false 
negative and a presence of a non-existing edge is a false 
positive. 

Results: yeast time series data 

Table 1 summarizes the assessment results for different 
methods applied to the yeast dataset. Our new ScanBMA 
method with nvar — 20 had a precision of 0.39, much 



higher than any other method, including our previous 
method, iBMA [19]. The area under the ROC curve 
(AUROC) for ScanBMA was also much better than for 
the competing methods. Note that for random guessing, 
the expected AUROC is 0.5 and the expected AUPRC is 

0. 038. For these data, the mutual information methods 
CLR and MRNET produced very large networks — nearly 
half of the number of possible edges, which is not concor- 
dant with what is known about regulatory networks of this 
type. 

Table 2 compares the preferred version of ScanBMA 
with four other versions, each of which lacks one of the 
components of our final method. This table shows that 
each component contributes to the accuracy of our final 
network. As shown by [19], the incorporation of the infor- 
mative prior yields the largest increase in performance, 
but the other components also contribute. The use of 
the ^-prior reduces the number of false positives, while 
the data transformation substantially reduces the size 
of the inferred network. 

The precision-recall curves for the different methods on 
the yeast data are shown in Figure 1. This figure shows 
the quality of the ScanBMA network even beyond the 95% 
posterior inclusion probability cutoff. The precision stays 
high through a large range of recall, whereas for the other 
methods it quickly drops to the level of random guessing. 
The figure also shows that nvar = 20 performs better than 
nvar = 3556. 

When analyzing gene networks where the number of 
nodes is in the thousands, computation time can be an 
important consideration. We compared ScanBMA with 
the other methods on the yeast data by running each 
method on 20 target genes under controlled conditions to 
find the average cpu time per gene. 

Table 3 shows that ScanBMA with nvar = 3556, 

1. e. considering all other genes whose expression varied as 
potential regulators, is within a factor of 3 of LASSO, the 
fastest of the other methods. Some of the mutual informa- 
tion methods, on the other hand, are much slower, with 



Table 1 Performance of different methods on the yeast data 


Method 


Precision 


AUROC 


AUPRC 


TP 


FP 


LASSO 


0.046 


0.506 


0.0416 


996 


20,469 


ARACNE 


0.205 


0.502 


0.0399 


69 


268 


CLR 


0.039 


0.510 


0.0435 


8,879 


220,942 


MRNET 


0.039 


0.513 


0.0442 


8,737 


214,757 


ScanBMA^^o] 


0.391 


0.601 


0.0747 


227 


353 


ScanBMAt3556] 


0.274 


0.629 


0.0740 


127 


336 


iBMAfioo] 


0.180 


0.517 


0.0788 


593 


2,702 



AUROC is the area under the ROC curve, AUPRC is the area under the precision-recall curve, and TP and FP are the numbers of true positive and false positive edges 
inferred, respectively. Thus TP+FP is the number of edges in the inferred network and Precision = TP/(TP+FP). ScanBMA was applied to the transformed data using 
the informative edge prior and Zellner's g-prior for the model parameters. The superscript indicates the value of nvar. Expected precision and AUPRC from random 
guessing is 0.0380. 
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Table 2 Performance of different versions of ScanBMA on the yeast data, either on the original scale (Orig) or 
transformed (Trans) 



Data 


nvar 


Priors 


Precision 


AUROC 


AUPRC 


TP 


FP 


Trans 


20 


g, inform 


0.391 


0.601 


0.0747 


227 


353 


Trans 


3556 


g, inform 


0.274 


0.629 


0.0740 


127 


336 


Trans 


20 


BIC, inform 


0.244 


0.590 


0.0616 


200 


619 


Orig 


20 


g, inform 


0.274 


0.586 


0.0680 


552 


1460 


Trans 


20 


g, Guelzim 


0.175 


0.499 


0.0395 


34 


160 



For the priors, inform refers to the informative prior, while Guelzim refers to the prior probability of 2.76/6000 for all possible relationships. 



Figure 2. Figure 3 shows the actual networks returned by 
each method in comparison with the true network. The 
ScanBMA network resembles the true network fairly well 
In particular, the compactness of the ScanBMA network 
is apparent, particularly when compared with LASSO and 
MRNET. The small number of false positives may be use- 
ful in focusing the attention of the biologist on edges of 
high interest when searching for new regulatory relation- 
ships. 

The results of the methods for the DREAM4 100-gene 
networks are summarized in Table 5. For these networks, 
the mutual information methods MRNET and CLR per- 
formed best in terms of the area under curve measures. 
ScanBMA was not quite as good by these measures, but 
its precision was much higher than that of any other 
method. Figure 4 illustrates the precision-recall curves for 
the various methods, showing increased precision across 
a broader range for ScanBMA. 

Conclusions 

We have presented a Bayesian Model Averaging method 
for inferring gene regulatory networks from time series 
data. It incorporates external information in a principled 
way via the prior edge probabilities, transforms the data 



MRNET taking about 50 times longer than ScanBMA. 
Table 3 also shows that ScanBMA produces a substantial 
improvement in computational efficiency over our pre- 
vious method IBM A [19], especially when nvar is large. 
Dynamic Bayesian network methods were not included in 
the comparison because they analyze the entire network 
at once and do not scale well enough to be run feasibly on 
large network datasets such as the yeast data. 

Results: simulated DREAM4 data 

Table 4 summarizes the results of the competing meth- 
ods for the DREAM4 10-gene networks. For the DREAM4 
networks, we were able to add the Dynamic Bayesian 
Networks method, as implemented in the ebdbnet R 
package [30], to the comparison because the networks are 
small. ScanBMA again performed best among the meth- 
ods, particularly in terms of the areas under the ROC and 
precision-recall curves, even though no external infor- 
mation was available. However, the extent of ScanBMAs 
superiority to other methods, notably LASSO, was smaller 
in this case than for the yeast data, reflecting the lack of 
external information. 

The precision-recall curves from the various methods 
are shown for the first of the five 10-gene networks in 




aracne 
cir 

mrnet 
lasso 

scanbma 20 
scanbma 3556 
random guessing 



0.00 



0.05 



0.10 
Recall 



0.15 



0.20 



Figure 1 Yeast precision-recall curves. Precision-Recall curves for different methods on the yeast data. ScanBMA was run using the ^-prior, 
transformed data, and informative prior with nvar=20 and 3556. 
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Table 3 Average CPU time per target gene of different 



methods for the yeast data 


Method 


Running Time Per Gene (s) 


LASSO 


4.1 


ARACNE 


70.4 


CLR 


7.9 


MRNET 


>500 


ScanBMA[20] 


0.04 


ScanBMA[3556] 


11.2 


iBMA[20] 


0.08 


iBMAt3556] 


85 



ScanBMA was run with g-prior, transformed data, informative prior. Superscript 
indicates value of nvar. 



to reduce spurious correlations, and uses Zellner's ^-prior 
for model parameters, with^ estimated from the data. We 
have introduced a new algorithm, ScanBMA, to search the 
model space efficiently. Our method infers compact net- 
works with higher precision than the competing methods 
we have assessed, important features for further analysis 
in searching for new regulatory relationships. 

We found that our method outperformed previous 
methods as well as LASSO and mutual information meth- 
ods on yeast time-series data. In addition, our method 
performed comparably to competing methods, including 
Dynamic Bayesian Networks, on simulated data from the 
DREAM4 challenge, even in the absence of prior informa- 
tion. The networks from ScanBMA are also similar in size 
to the target networks. 

Methods 

Bayesian model averaging 

In regression-based methods for network inference, we 
infer regulators (parent nodes) for each target gene. 
Hence, network inference can be formulated as a series 
of variable selection problems. We use the following mul- 
tiple linear regression model for network inference from 
time-series data: 



Table 4 Average performance of different methods on the 
DREAM4 10-gene networks 



Method 


Precision 


AUROC 


AUPRC 


TP 


FP 


LASSO 


0.190 


0.731 


0.487 


62 


265 


ebdbnet 


0.509 


0.704 


0.438 


28 


27 


ARACNE 


0.304 


0.668 


0.388 


35 


80 


CLR 


0.215 


0.681 


0.397 


50 


183 


MRNET 


0.215 


0.709 


0.409 


53 


193 


ScanBMA 


0.432 


0.740 


0.505 


35 


46 



ScanBMA was run with the original data. The true positive (TP) and false positive 
(FP) columns are totaled across all 5 networks. There are 71 true edges across the 
5 networks. 



heH 

where Xi^t,s is the expression level for gene i at time t for 
strain or replicate 5, H is the set of potential regulators, 

and £i^t,s A/^(0, cTg) is the error term, for / = 1, ^ = 
2, . . . , r and 5 = 1, . . . , 5. We are particularly interested in 
whether fihi ^ 0, indicating a regulatory relationship from 
gene h to gene i. 

One difficulty is that for gene network time series data 
there are typically far more potential regulators than 
observations. To address this problem, as well as to take 
into account uncertainty in model selection, we use BMA 
to obtain the posterior probability that each regulator is 
in the model [13,14]. BMA takes model uncertainty into 
account by averaging over the posterior distributions of a 
quantity of interest based on multiple models, weighted 
by their posterior model probabilities. Thus the posterior 
probabiUty that 7^ 0, also called the posterior inclusion 
probability of fihh is 

K 

?r(fihi ^ 0|D) = ^^^hi ^ 0|AM^)Pr(M^|D), 

where V\i(Mk\D) oc f p{D\Oj,,Mk)p(Ok\Mk)dOk is the pos- 
terior probability of model given data A p(D\Oj(yMk) 
is the likelihood of the parameter vector % of model 
M/^y the prior probability of model M/^ is Pr(M/^) a 

UheM,, ^hi Whmk ~ P^^^^ probability that 

gene h regulates gene /, and the models considered are 
denoted by Mi, . . . , Mk. 

An additional issue is that the number of possible mod- 
els is too large to enumerate in a reasonable amount of 
time. MCMC approaches have been applied where the 
number of potential regulators is large [31], but they are 
expensive computationally. Iterative BMA has also been 
applied to large genetics datasets, but the number of 
regulators was restricted prior to analysis [18,32]. 

Data transformations 

One concern when identifying regulatory relationships 
among genes is that there is a great deal of variation in 
gene expression levels that does not come from these 
interactions. For example, in the yeast data, many of the 
genes experience a sharp change in expression level over 
the first few time points caused by the application of 
the drug. This common trajectory is not important for 
inference and can produce many large correlations that 
do not correspond to actual interactions. In addition, we 
have found that removing the effect of a gene on itself 
can improve inference. By doing this we are removing 
excess variation in order to gain accuracy in inferring 
relationships. 
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Figure 2 10-gene Precision-Recall curves. Precision-Recall curves for various methods on network 1 of the 1 0-gene networks from the DREAM4 
competition. This network has 1 5 true edges. 




True Network ScanBMA Network 




LASSO Network ebdbnet Network 




ARACNE Network MRNET Network 

Figure 3 DREAIVI4 10-gene network visual comparison. 



In light of these observations, we transform the data 
in two steps to remove these extra sources of variation. 
First, we use time-adjusted data by subtracting the mean 
expression level for each gene / at timepoint t across 
strains: 

where Xi^t,- = Yls=i^i>t,s' This removes the overall 
effect of the drug. 

Second, we take the residuals from regressing the gene 
on itself at the previous timepoint: 

where ai is the regression coefficient in the simple linear 
regression model of the expression level of gene / at time t 
on its expression level at time t — 1: 

^i,t,s = (^i^i,t-l,s + ^i,t,si 



Table 5 Average performance of different methods on the 
DREAM4 100-gene networks 



Method 


Precision 


AUROC 


AUPRC 


TP 


FP 


LASSO 


0.035 


0.643 


0.073 


571 


15757 


ebdbnet 


0.054 


0.643 


0.043 


182 


3201 


ARACNE 


0.114 


0.589 


0.106 


208 


1621 


CLR 


0.035 


0.699 


0.123 


678 


18669 


MRNET 


0.035 


0.701 


0.130 


689 


18784 


ScanBMA 


0.153 


0.657 


0.101 


193 


1062 



ScanBMA run with original data. The true positive (TP) and false positive (FP) 
columns are totaled across all 5 networks. There are 1 ,024 true edges across the 
5 networks 
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Figure 4 100-gene Precision-Recall curves. Precision-Recall curves for various methods on network 1 of the 1 00-gene networks from the 


DREAM4 competition. 







where 8^,8 ^(0? orf )♦ Then a/ is the least squares esti- 
mate of ai based on all S(T — 1) relevant observations for 
gene /. 

Specifying prior distributions 

An important feature of our method is a way to incorpo- 
rate external information. The Bayesian approach requires 
the specification of prior distributions and so includes this 
directly. 

BMA requires two types of prior information: the prior 
edge probability, tt/^/, that potential regulator h regulates 
target gene /, for each h and /, and the prior distribution of 
the parameter vector for each model considered. 

For the prior edge probabilities tt/^/, we considered two 
different prior distributions. The first is based on the 
empirical finding of Guelzim et al. [33] that each tar- 
get gene is regulated by a small number of transcription 
factors, estimated as 2.76 per gene [19]. We implement 
this by setting tt/^/ = 2.76/6000 for all h and /. This 
prior distribution does not incorporate any gene-specific 
information. We call it the Guelzim prior. 

The second prior edge distribution we consider is that 
of Lo et al. [19], which uses external information in 
the form of expression data, binding-site data, curated 
literature and other sources to come up with a prior 
probability for each individual possible regulatory rela- 
tionship. This leads to values of tt/^/ specific to each 
regulator-target pair. We refer to this as the informa- 
tive prior. Integration of multiple information sources 
has been shown to be beneficial in network construction 
[34,35]. 

As a prior for the model parameters, corresponding to 
the strengths of the relationships, we use Zellner s ^-prior 



[27], as in [36]. The prior distribution of the parameter 
vector (i^i^l ^f^l o-2[^]) of model is 



2[k] 



where X]^ is the design matrix for model and ^ > 0 
controls the prior variance of the regression parameters. 
This prior yields an analytic form for the posterior model 
probabilities Pr(M/^ ID), namely 

2\og(p{Mu\D)/pm\D)) 

= {n-l) log(l +^(1 - Rl)) -in- dk - 1) log(l + g) 

where Mq is the null model with no regulators, is the 
number of regulators present in model Af/^, and i?^ is 
the value for M^, This is an alternative to using BIG 
to approximate the posterior model probabilities, as has 
been done previously [32,37]. 

The parameter g controls the expected size of the 
regression parameters Phu and is approximately equal to 
the prior variance of Phil'^^(Phi)y where Phi is the OLS 
estimator. This suggests that g should be at least 1, oth- 
erwise the individual fi^i^ would be expected to be nearly 
indistinguishable from the noise even under ideal con- 
ditions. Also, the effective number of data points in the 
prior is n/g. Using g = n corresponds to a unit infor- 
mation prior and yields similar results to using the BIG 
approximation. Raftery [38] argued that the prior should 
be no more spread out than the unit information prior, 
suggesting a choice of ^ in the range 1 < g < n. 
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We estimate^ from the data by maximum marginal like- 
lihood, where the likelihood is summed over the model 
space. We do this using an EM algorithm [39,40], where 
the "missing data" are the y/^/ s indicating whether regula- 
tor gene h is in the model for target gene /. First, we run 
BMA with a selected value of gy and from this we obtain 
the posterior probabilities of the models used. We then 
maximize 

K 

J2 MMk\D) {(n - 4 - 1) logd + g) 

k=i 

-(n-l)log[l+g(l+Rl)]) 

with respect to g. We use the new value of g in the next 
iteration of ScanBMA. We do this until convergence. 

Searching the model space using ScanBMA 

To find the models to be included in the BMA summa- 
tion, we use the Occam's window principle, according to 
which models that are substantially worse than the best 
model found (by a factor of C in posterior probability) are 
discarded [28]. We used C = 100, based on an extensive 
review of conventional standards of scientific evidence. To 
find the models in Occam's window, we propose a new, 
fast algorithm called ScanBMA. 

The main idea of ScanBMA is to keep an active set of 
models around which to search. All models which can be 
created by adding or removing a single predictor from 
those in the active set are then evaluated and, if they fall 
within Occam's window of the current best model, are 
added to the active set to expand the search. The method 
is outlined in Algorithm 1. 



Algorithm 1: ScanBMA 

Initialize Mkeep^Mnext = {} 

Initialize Ai active = {null model}, bestScore = 0 
while M active ^ot empty do 
for model 

n^new ^n 

^eighbonsOiiM active) do 
mScore = EvaluateModelScore(m«ew) 
if mScore in OccamsWindow(l7e5^5coreJ then 

add mnew Mnext 

bestScore = BestModelScore(Z?e5^5core, 
mScore) 
end 
end 

Trim models from A4 keep according to bestScore 
Add good models from Mactive to Mkeep 
Mactive = good models from Mnext 
end 

return Mkeep 



Because ScanBMA does not average over every model in 
the model space, the algorithm yields posterior inclusion 
probabilities of either 100% or 0% for many regulators. 
These extreme posterior inclusion probabilities are only 
approximations, and we can refine them. To estimate the 
posterior inclusion probability of a predictor with an 
approximate posterior probability of 100%, we calculate 
the ratio 0-h,i of the posterior probability of the best 
model to that of the best model with the predictor 
removed. We then approximate the posterior probability 
of predictor h by 0-h,i/(l + 0-h,i)y which will be less than 
100%. 

Similarly, to estimate the posterior inclusion probability 
of a predictor Xpi with approximate posterior probability 
0%, we compute Oh,u the ratio of the posterior probability 
of the best model to that of the best model with the predic- 
tor Xh added. Then the approximate posterior inclusion 
probability of predictor /z is (1 + Oh,i)~^, which will be 
greater than 0%. 

This post-processing step yields a unique ordering 
among the predictors, which is useful in evaluation. 

A variant of ScanBMA that we found to greatly improve 
computational efficiency without degrading performance 
is to restrict the number, nvar, of potential regulators h 
considered for each target gene / to those with the high- 
est prior probabilities, tt/^/. Guelzim et al. [33] found that 
the number of transcription factors per gene has approx- 
imately an exponential distribution with mean 2.76; they 
did not find any of the target genes that they considered 
to have more than 13 regulators. As a result, we con- 
sider nvar = 20, which leaves some margin over the 
Guelzim maximum. We compare this with considering all 
genes with observed variation in expression as potential 
regulators, which amounts to setting nvar = 3556. 

Data 

To validate our method, we applied it to time-series data 
from a gene expression experiment on yeast as well as 
simulated time-series datasets from the DREAM4 compe- 
tition. 

The yeast data come from an experiment on 97 strains of 
yeast crossed from two parent strains [18] . Each strain was 
subjected to a treatment of the macrolide drug rapamycin, 
chosen to cause changes in gene transcription across 
the genome. Gene expression levels were measured for 
approximately 6,000 genes every 10 minutes from 0 to 
50 minutes after the administration of the drug, using 
Affymetrix microarrays. The data were then filtered to 
remove genes that did not show significant variation over 
the measurements, leaving 3,556 genes. 

The yeast data can be represented as a three- 
dimensional array consisting of 3,556 genes, 97 segregants 
and 6 time points. There are no replicates in these data. 
However, the segregant axis and the time axis capture the 



Young etal. BMC Systems Biology 2014, 8:47 
http://www.biomedcentral.eom/1752-0509/8/47 



Page 9 of 1 1 



genetic and temporal variations. These yeast data are pub- 
licly available from ArrayExpress with accession number 
E-MTAB-412. 

The simulated DREAM4 In Silico Network Challenge 
[41-43] provided 5 networks each of size 10 and 100 genes 
[44]. Time-series data for each network are produced by 
artificially perturbing a portion of the genes in the net- 
work and simulating the response of the network over 
time. 

Specifically, the size 10 DREAM data consist of 10 
genes, 21 time points and 5 replicates. The size 100 
DREAM data consist of 100 genes, 21 time points, and 10 
replicates. 

Since our focus is on time-series data, we did not use 
the other data sources provided by the DREAM4 chal- 
lenge. In particular, the DREAM4 challenge provided the 
results of simulated gene knock-out experiments for all 
genes, and in fact the winning entry in the competition 
used only the knock-out data and ignored the time series 
data [45] . In practice, however, this is unrealistic, since it 
is typically feasible to do knock-out experiments for only a 
small proportion of the genes. Time series data do have the 
potential to provide some information about all potential 
regulatory relationships, and our goal is to develop meth- 
ods for doing this, so we have ignored the knock-out data 
here. 

Competing methods 

To evaluate the performance of our method, we compared 
it with LASSO [9], as well as with the mutual information 
methods MRNET, CLR and ARACNE. LASSO is a vari- 
able selection method based on a linear regression model, 
and thus can be used as an alternative to BMA. It is known 
for being computationally efficient. 

We used the implementation of LASSO in the glmnet 
package in R [12], with the shrinkage penalty parameter, 
sometimes denoted by A, chosen by cross-validation. 

Mutual information methods have been used exten- 
sively in identifying relationships among genes [21-23]. 
Since mutual information methods are non-directional, 
we used a modified version inspired by [46], including 
the response as the first column of the matrix given to 
the mutual information method, and the predictors as the 
other columns. We then took the first column from the 
resulting mutual information matrix as the measure of 
the directed relationship from the predictors to the target 
gene. MRNET, CLR and ARACNE are different methods 
for using the mutual information to infer a weighted adja- 
cency matrix between genes. All are implemented in the 
minet package in R [47]. 

We have also investigated the possibility of using 
Dynamic Bayesian Networks (DBN). Methods based on 
DBNs have been implemented in several R packages. 
These include the GeneNet package [48], but this is not 



designed for multiple time-series and so was not a compa- 
rable method on our datasets. The ebdbnet (Empirical 
Bayes Estimation of Dynamic Bayesian Networks) R pack- 
age [30], and the GIDBN R package [49], do allow for 
multiple time-series, but they are not designed to han- 
dle networks of the size of the yeast data. They are more 
appropriate for networks of sizes up to a few hundred 
genes. We were able to use the ebdbnet package for the 
much smaller DREAM4 networks. 

YEASTRACT: Assessment of yeast data 

The YEASTRACT database [29] is a literature-curated 
repository of regulatory relationships between known 
transcription factors and target genes in yeast, based on 
more than 1,300 literature references. Among the 3,556 
genes in our filtered yeast time- series data, this database 
contains documented regulatory relationships for 127 
transcription factors (TFs), encompassing a total of 17,173 
edges. We therefore evaluate only inferred relationships 
from these 127 transcription factors, and inferred reg- 
ulatory relationships from other genes are not used in 
evaluation. 

Availability of supporting data 

The yeast time series data are publicly available from 
ArrayExpress http://www.ebi.ac.uk/arrayexpress [50] with 
accession number E-MTAB-412. The DREAM4 data 
are publicly available from http://wiki.c2b2.columbia.edu/ 
dream/index.php?title=D4c2 [44]. 
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